This series of notebooks will cover the practical application of so called state-space models towards understanding market behaviour as well as the underlying market microstructure and how this new information can help us design profitable algorithmic trading models. I will mention right away that a purely theoretical description, while necessary to understand the topic in full, will most likely be counterproductive as it is can be very math heavy; instead I will try to build an intuitive understanding of the subject via a practical treatment of various ideas and concepts encountered within. That being said, mathematics are sometimes unavoidable and I will need to define certain ideas mathematically first before further explanation can be given. Finally, at this stage, this notebook is largely experimental and many of the ideas presented here, while indeed very promising, will require more work and testing before we can make use of them in a live scenario.
Before we can approach the subject at large we should first cover some basic concepts and explain our motivation for state-space modelling.
A raw price series is a continues time series devoid of all other information but the price and volume of an asset, future, derivative or any other financial instrument. For illustrative purposes here is an hourly OHLC (open-high-low-close) chart of the BCH/BTC pair.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_finance import candlestick2_ochl
s = pd.read_csv('/home/lisper/fintech/data/btc_bch_1h.csv')
_, ax = plt.subplots(figsize=(15,  5))
candlestick2_ochl(ax, s['open'], s['close'], s['high'], s['low'], width=2)
plt.show()
We are interested in extracting additional information, importantly valuable information we can trade on, from this raw object.
Now, at first glance this seems like we are approaching the domain of Technical Analysis (TA). Note the capitalization, I'm not talking about technically assessing prices and price levels and so on, TA here means relying on so called "technical indicators" like the "RSI", "William's R%" and many, many other indicators which are purely influenced by historic price and volume levels to understand and model market behaviour. Many, including myself, view this type of approach as pseudo-scientific at best largely because TA is mostly interpretive and expressions like "You just didn't use it properly" or "The market is just TA resistant right now" and "He's really good at TA" are rather common.
While TA can't be dismissed outright (we would need to prove beyond any doubt that all of the 10k or so indicators are no better than random first), the TA approach itself or rather how it is practiced currently, should be safely dismissed as it is not objective and relies on subjective interpretation of indicators. In other words, it is more (much more) art than science. Practitioner A and Practitioner B trading the same currency pairing and relying on the same indicators are likely to interpret the generated buy/sell signals in vastly different ways. Designing an algorithm which will rely on technical indicators will thus require a subjective heuristic of sorts, making personal bias unavoidable...
Beyond that, one wouldn't be wrong in assuming that TA is inefficient purely because it is in such wide practice. Most TA forums boast hundreds of thousands of subscribers, yet, and of course this is purely anecdotal but seems rational enough an assumption, the vast majority are unable to outperform the market. The usual retort to that is "Well, you just don't know how to use it well" which we of course covered above.
The Efficient Market Hypothesis (EMH) states that asset prices fully reflect all available information making it impossible to beat the market as no practitioner is able to either purchase undervalued assets or sell assets for inflated prices. Quoting from wikipedia
There are three variants of the hypothesis: "weak", "semi-strong", and "strong" form. The weak form of the EMH claims that prices on traded assets (e.g., stocks, bonds, or property) already reflect all past publicly available information. The semi-strong form of the EMH claims both that prices reflect all publicly available information and that prices instantly change to reflect new public information. The strong form of the EMH additionally claims that prices instantly reflect even hidden "insider" information.
Whatever view of the EMH is taken, it strongly implies that historic price information can not serve as a signal worth trading on as all of the information contained within is total and available to everyone else the moment it becomes available to us, thus it is arbitraged away by default.
The "Random Walk Hypothesis" views the evolution of a price series subject to a random walk distribution. A random walk is a stochastic mathematical process that describes a path consisting of a series of random steps within a given mathematical subspace. This implies the inability to explain current or future price behaviour based on historic price behaviour alone. Below is a random walk model of a fictional asset's closing prices. The stochastic process behind our asset's value is a simple calculation based on consecutive digits of $\Pi$: increase when even, decrease if odd.
from mpmath import mp
bp = 100.0
inc = 2.
mp.dps = 1000
closes = [bp]
for p in [inc if int(i) % 2 == 0 else -inc for i in str(mp.pi) if i.isdigit()]:
    bp += p
    closes.append(bp)
_, ax = plt.subplots(figsize=(15,  5))
plt.plot(closes)
plt.show()
Notice how the resulting series resembles the price behaviour of a real asset.
Both the EMH as well as the RWH assumptions of market behaviour are closely related and certainly worth a consideration at least.
However, is price behaviour truly random? More importantly, does EMH hold true no matter what?
Modelling under the random walk distribution took root in the late 60s. Lorie and Hamilton in their 1973 paper on distributions of stock returns first came to the conclusion that "price has no memory", if an investors technique is conditioned on some ad hoc price configuration, there will be little value added because a random walk stock will give us no profitable clues about future prices. It was Fama who artfully formed and expanded the notion of random walk into what he popularized as the efficient market hypothesis in the mid 1970s. Such was the brilliance of Fama's paper that within no time at all the idea of the efficient and perfect market was firmly implanted in the brains of most financial economists, academics and econometricians, so much so that any divergent view would be akin to career suicide. Empirical findings disproving the EMH assumption or at least poking several large holes in the theory existed however never saw the light of day in a rather hostile environment.
Yet, where there is a will there is usually a way and by the early 80s there came a large volume of formal literature that discovered inefficiencies that could lead to abnormal returns if rigorously applied. The list includes size effect, January effect, value irregularities, momentum effect, etc.
Like in most sciences the road towards getting published was of course to operate under the formal assumption of EMH and thus all irregular inefficiencies were called "anomalies" it was also acknowledged that the discoveries were (1) historic and likely not repeatable (now that we know them) (2) inconclusive because of potential "risk misspecification" (3) lacking proper allocation of costs when formulating strategies.
It really took a new and emerging field, the field of behavioural finance to successfully argue against EMH and establish the ineffective market as a viable model. Psychologists and behavioural scientists formulated such notions as "herd behaviour in financial practitioners", "irrational exuberance", "under- and overreaction", "biased self-attribution" and many, many others. These and similar models effectively explained real market behaviour like economic bubbles, heteroskedastic shock, price drift and so on.
By the late 90s the market was assumed inefficient and ineffective by default.
A few notable models came out towards the end of the millennium. Daniel, Hirschleifer, and Subrahmanyam (DHS) (1998) assume that investors are overconfident about their private information, and their overconfidence increases gradually with the arrival of public information with biased self-attribution.
Hong and Stein (HS) (1999) make two assumptions: (1) investors are bounded rational, meaning that they have limited intellectual capacity and that they are rational in processing only a small subset of the available information; (2) information diffuses slowly across the population.
Barberis, Shleifer, and Vishny (BSV) (1998) suggest that investors exhibit two biases in updating their prior beliefs with public information: Conservatism and Representativeness. Conservatism states that investors are slow to change their beliefs in the face of new evidence; while representativeness is a biased understanding of probabilities, that is assessing the probability of an event by finding a "similar known" event and assuming that the probabilities will be similar.
The findings of Hong and Stein in particular will be of significant importance to us throughout the rest of this notebook.
Modern quantitative finance operates under the assumption of market inefficiency, which is attributed to investors irrationality as explained by behavioural finance and behavioural econometrics. That being said, it is very important to not fall into that basket of irrationality ourselves when designing and executing strategies. While it can be shown that past price behaviour does impact future price behaviour it is important to quantify this assumption and it is equally as important to design an objective interpretation to model this type of behaviour. In other words, the TA approach is still ill-conceived and ill-fitted to model a perfectly rational or a perfectly irrational market...
As mentioned above, HS's findings are of particular interest as they carry over certain implications about the relation between past and future price behaviour. One notable anomaly encountered in almost all financial time series is that of momentum. Quoting wikipedia
In finance, momentum is the empirically observed tendency for rising asset prices to rise further, and falling prices to keep falling. For instance, it was shown that stocks with strong past performance continue to outperform stocks with poor past performance in the next period with an average excess return of about 1% per month.
Momentum is an anomaly because there is no rational (rational as viewed through the EMH lens) explanation why an asset's past performance would impact its future value. Momentum can be short lived and long lived, it can trend upwards or downwards. Momentum, in most cases, suffers from reversal.
HS view momentum as influenced by mostly two factors. They specify two bounded rational agents - news-watchers and momentum traders themselves. Both are risk-averse, and their interactions set future prices. On the one hand, news-watchers exhibit similar behaviour to a typical fundamental manager in practice, observe some private information and ignore information in past and current prices. On the other hand, momentum traders condition their forecasts only on past price changes, and their forecast method is simple. The slow diffusion of information among news-watchers induces underreaction in the short run. Underreaction leads to positively autocorrelated returns - momentum continues. Upon observing this predictable return pattern, momentum traders condition their forecast only on past price changes and arbitrage the profit opportunity. Arbitrage activity eventually leads to overreaction in the longrun, creating dislocation between price and fundamentals. The reversal of price back to fundamental is the source of momentum reversal.
In simpler terms, momentum is in part motivated by our limited capacity to react and process new information, as new information arises it reaches more and more participants in a diffused manner rather than all at once. Participants are also slow to react and act upon this new information. This leads to a certain lag in price behaviour. While under the EMH assumption any information is instantly reflected in the price of an asset, in reality the slow diffusion of information leads to a rather prolonged period of adjustment and reflection.
Momentum is also motivated by momentum traders themselves. For example, a TA practitioner is betting on exploiting this phenomenon by gauging for momentum and longing when the price goes up while shorting when the price goes down. He might rely on something like the MACD (moving average convergence divergence) indicator to find a good spot for entry and exit. This becomes a self fulfilling prophecy of sorts as more and more momentum traders jump in to arbitrage this opportunity. Momentum continues to build stronger and finally collapses as the assets price grows (or falls) too divergent from its fundamental value leading to momentum reversal.
Taking the above into account we are able to formulate our first belief and strategy. Here presented in brief:
Admittedly this is a rather uninvolved strategy, we simply bet on capturing momentum and reversal faster than the general market.
Before we can continue we should quantify some of our assumptions about the presence of momentum by performing various tests for serial correlation. Serial correlation is the relationship of a variable to itself over a temporal dimension. Strong serial correlation naturally implies that past values are indicative of present and future values.
First let us load our time series. Here's the yearly ETH/BTC pairing denominated in hours.
ethu = pd.read_csv('/home/lisper/fintech/data/eth_btc_usdt_1h.csv')
_, ax = plt.subplots(figsize=(15,  5))
candlestick2_ochl(ax, ethu['open'], ethu['close'], ethu['high'], ethu['low'], width=4)
area1 = plt.Rectangle((2200, 0), 200, 1, color='black', alpha=0.2)
area2 = plt.Rectangle((2850, 0), 330, 1, color='black', alpha=0.2)
area3 = plt.Rectangle((4450, 0), 500, 1, color='black', alpha=0.2)
area4 = plt.Rectangle((11690, 0), 850, 1, color='black', alpha=0.2)
area5 = plt.Rectangle((13300, 0), 850, 1, color='black', alpha=0.2)
area6 = plt.Rectangle((14300, 0), 850, 1, color='black', alpha=0.2)
ax.add_artist(area1)
ax.add_artist(area2)
ax.add_artist(area3)
ax.add_artist(area4)
ax.add_artist(area5)
ax.add_artist(area6)
plt.show()
Areas of increased volatility with potential correlation are highlighted above. Further assessment can be made by converting the series to percentage returns.
_, ax = plt.subplots(figsize=(15,  5))
plt.plot(ethu['close'].pct_change().dropna())
area1 = plt.Rectangle((1450, -1), 600, 2, color='black', alpha=0.2)
area2 = plt.Rectangle((2150, -1), 1200, 2, color='black', alpha=0.2)
area3 = plt.Rectangle((3490, -1), 2100, 2, color='black', alpha=0.2)
area4 = plt.Rectangle((8700, -1), 2100, 2, color='black', alpha=0.2)
area5 = plt.Rectangle((11000, -1), 2100, 2, color='black', alpha=0.2)
ax.add_artist(area1)
ax.add_artist(area2)
ax.add_artist(area3)
ax.add_artist(area4)
ax.add_artist(area5)
plt.show()
The graph above shows some of the volatility clusters (this effect is known as heteroskedasticity) even more clearly. There's a strong implication of serial correlation. However, we should really perform an objective test for serial correlation. We can do this by choosing a holding window and a lookback window, iterating over the closing prices at every hour and computing the correlation coefficient and its p-value. An increasing correlation value and a small p-value (< 0.05 or the 95th percentile, allowing us to dismiss the null-hypothesis) suggests strong serial correlation in the series.
import scipy.stats
def p_corr(df1, df2):
    corr = df1.corr(df2)
    N = np.sum(df1.notnull())
    t = corr*np.sqrt((N-2)/(1-corr**2))
    p = 1-scipy.stats.t.cdf(abs(t),N-2)  # one-tailed
    return corr, t, p
pd.options.display.float_format = '{:,.11f}'.format
res = []
for lookback in [1, 5, 10, 25, 60, 120, 250]:
    for holdhours in [1, 5, 10, 25, 60, 120, 250]:
        df_Close_lookback = ethu.close.shift(lookback)
        df_Close_holdhours = ethu.close.shift(-holdhours)
        ethu['ret_lag'] = (ethu.close-df_Close_lookback)/df_Close_lookback
        ethu['ret_fut'] = (df_Close_holdhours-ethu.close)/ethu.close
        dfc = ethu[['ret_lag','ret_fut']].dropna()
        idx = None
        if lookback >= holdhours: 
            idx = np.array(range(0,len(dfc.ret_lag), holdhours))
        else: 
            idx = np.array(range(0,len(dfc.ret_lag), lookback))
        dfc = dfc.iloc[idx]
        t, x, p = p_corr(dfc.ret_lag, dfc.ret_fut)
        res.append([lookback, holdhours, t, p])
res = pd.DataFrame(res,columns=['lookback','holdhours','correlation','p value'])
print(res[res['lookback'] >= 10])
Notice the progressively increasing correlation and the fairly small p-values. We can test this further by designing a very simple strategy based on these results. We pick an acceptable lookback and p-value pair, enter a long if the cumulative returns over the lookback window are positive (this indicates momentum) and close the position as soon as the hold period expires.
def calculateMaxDD(cumret):
    highwatermark=np.zeros(len(cumret))
    drawdown=np.zeros(len(cumret))
    drawdownduration=np.zeros(len(cumret))
    for t in range(1,len(cumret)):
        highwatermark[t]=np.max([highwatermark[t-1], cumret[t]])
        drawdown[t]=(1+cumret[t])/(1+highwatermark[t])-1
        if (drawdown[t]==0):
            drawdownduration[t]=0
        else:
            drawdownduration[t]=drawdownduration[t-1]+1
    return np.min(drawdown), np.max(drawdownduration)
def report(df,lookback,holdhours):
    longs = df.close > df.close.shift(lookback)
    shorts = df.close < df.close.shift(lookback)
    df['pos'] = 0.
    for h in range(holdhours):
       long_lag = longs.shift(h).fillna(False)
       short_lag = shorts.shift(h).fillna(False)
       df.loc[long_lag,'pos'] += 1
       df.loc[short_lag,'pos'] -= 1
    ret=(df.pos.shift(1)* (df.close-df.close.shift(1)) / df.close.shift(1)) / holdhours
    cumret=np.cumprod(1+ret)-1
    print('APR', ((np.prod(1.+ret))**(17500./len(ret)))-1)
    print('Sharpe', np.sqrt(17500.)*np.mean(ret)/np.std(ret))
    print('Drawdown', calculateMaxDD(np.array(cumret)))
    return cumret
cumret=report(ethu, lookback = 60, holdhours = 5)
_, ax = plt.subplots(figsize=(15,  5))
plt.plot(cumret)
plt.show()
Choosing a lookback window of 60 hours and a hold window of 5 hours (lowest p-value) nets an annual percentage rate of ~50%, not bad for such a terrible strategy! Of course the above is a rather simplistic approach to backtesting, for one we didn't take any fees into account and neither did we account for any slippage. A sharpe ratio of over 3 points is very optimistic. But nevertheless the above simply demonstrates the effects of correlation and momentum.
We can also visualize correlation directly by looking at correlogram plots. Below are two correlograms, the ETH/BTC price series to the left and a truly random normal series to the right. Notice the difference between the two plots.
from statsmodels.graphics.tsaplots import plot_acf
_, axarr = plt.subplots(1, 2, figsize=(20, 10))
plot_acf(ethu['close'], axarr[0], title='ETH/BTC pair autocorrelation')
plot_acf(np.random.normal(1, 1, 1000), ax=axarr[1], title='Random Normal distribution autocorrelation')
plt.show()
Once the presence of serial correlation has been established, we must next find a way to model it correctly. A perfect model is able to "explain away" the correlative random component of the time series. Lets imagine an artificial series which is dependent on time in such a way that the asset's price increases whenever the clock strikes midnight and decreases whenever the clock hits noon. An effective model is able to explain away the generative stochastic process behind the series (the clock in this example).
Another interpretation is that of signal and noise. We can look at a price distribution, for example the closing price series of an asset, as an observable, albeit noisy, signal. Our task is "smoothing" this signal, removing the noise so to say, to figure out the underlying process.
With the case of momentum we are really interested in the underlying trend. Rather than following every up and down, every bump and peak in a time series, we want to follow a clear trend, short- or longterm as early and as closely as possible.
Lets now look at some of the trend following algorithms available in the standard technical analysis toolbox. We will try to identify the underlying trend of the BCH/BTC pair. Below is a visualization of the hourly closing price series.
df = pd.read_csv('/home/lisper/fintech/data/btc_bch_1h.csv')
[o, h, l, c] = [x.reshape(521, ) for x in np.split(df[['open', 'high', 'low', 'close']].values, 4, 1)]
_, ax = plt.subplots(figsize=(10, 5))
area1 = plt.Rectangle((70, 0), 25, 1, color='black', alpha=0.2)
area2 = plt.Rectangle((105, 0), 20, 1, color='black', alpha=0.2)
area3 = plt.Rectangle((153, 0), 10, 1, color='black', alpha=0.2)
area4 = plt.Rectangle((240, 0), 25, 1, color='black', alpha=0.2)
area5 = plt.Rectangle((372, 0), 36, 1, color='black', alpha=0.2)
ax.add_artist(area1)
ax.add_artist(area2)
ax.add_artist(area3)
ax.add_artist(area4)
ax.add_artist(area5)
plt.plot(c)
plt.show()
Dismissing shorting opportunities for a moment, the highlighted areas above display some of the profitable longs via upward momentum trends.
The majority of all trend following indicators in use right now are based on four simple models:
$$x_t = \sum_{i=1}^p{\alpha_ix_{t-1} + w_i} \tag{Autoregressive Model of order p} \text{ } \text{where $w_t$ is white noise and $\alpha_i \in \mathbb{R}$}$$
Essentially the autoregressive model is simply an extension of the random walk (the random walk is defined as: $x_t = x_{t-1} + w_t$). It is a linear model where each new term $x_t$ depends linearly on previous terms $x_{t-1}$ with coefficients for each term $\alpha_i$ and a random noise component. The coefficients must be selected manually. Intuitively, the autoregressive model captures "shock" or "noise" through a regression process; each new term is the addition of noisy previous terms.
$$x_t = w_t + \sum_{i=1}^q{\beta_i w_{t-q}} \tag{Moving Average Model of order q} \text{ } \text{where $w_t$ is white noise and $\beta_i \in \mathbb{R}$}$$
The moving average model is very similar to the autoregressive model, only instead of being a combination of past time series values it is a combination of past "white noise" or "shock" values. In other words, it captures noise on the direct rather than indirect level.
Now the ARMA and ARIMA models are simply a combination of the previous two models with some additions. The ARMA model is able to capture noise directly (as an additive process like the MA model does) and indirectly through the addition of previous terms (like the AR model does). The ARIMA model is ARMA (direct and indirect capture of noise) with the addition of an integrated component; meaning the time series is integrated $d$ times by the model in order to de-noise it of seasonal trends, heteroskedastic clusters, etc before the regular ARMA model is applied.
Below is a visualization of various AR, MA, ARMA and ARIMA based indicators applied to our BCH/BTC series.
import talib
_, ax = plt.subplots(3, 2, figsize=(30, 25))
inds = [
    ('EMA', talib.EMA(c, 7)),
    ('DEMA', talib.DEMA(c, 7)),
    ('HT', talib.HT_TRENDLINE(c)),
    ('KAMA', talib.KAMA(c, timeperiod=7)),
    ('T3', talib.T3(c, timeperiod=7, vfactor=1)),
    ('VMA', talib.WMA(c, timeperiod=7))
]
for j in range(0, 3):
    for i in range(0, 2):
        name, ind = inds.pop()
        ax[j][i].plot(ind, label=name, color='r')
        ax[j][i].plot(c, label='Actual', linestyle=':')
        ax[j][i].legend()
plt.show()
Upon first inspection all indicators capture an underlying trend to various degrees of success. However, upon closer inspection a number of issues should become apparent. For one, as in the case of the T3, EMA, HT and KAMA indicators, lag is a persistent problem. The model does not react to the underlying change of trend in due time. DEMA on the other hand captures the trend in time, however the resulting "smooth" signal is itself quite noisy plagued by various bumps and peaks. Over- and undershooting or over- and underreaction are also a problem area.
Evaluating the results above we can simply conclude that our models are simply unable to correctly capture noise and explain it away via regression, moving averages and so on.
An issue with all the models described above is their treatment of noise at large. Noise is viewed in a somewhat static manner, it is clear that we must represent noise itself through some stochastic process and find a model able to adjust the representation of noise as it is processing it.
The general premise of a state space model is that we have a set of states that evolve in time (for example momentum at time $t_{i-1}$ somehow changes and evolves to momentum at time $t_i$), but our observations of these states contain statistical noise (such as market infrastructure noise, irrational behaviour by participants, etc.) and hence we are unable to ever observe the "true" state. The goal of the state space model is to infer information about the states (for example identifying a trend), given the observations, as new information arrives.
State space models are also very general and is possible to put all of the models evaluated above into a state space formulation thus making them more dynamic.
The state space model is given by the following relationship
$$\theta_t = G_t \theta_{t-1} + w_t \tag{State equation}$$ $$Y_t = F_t^T \theta_t + v_t \tag{Observation equation} \text{, where $F^T$ is the matrix transpose of $F$}$$
The state space model represents hidden states at time $t$: $\theta_t$ as a linear combination of previous states $\theta_{t-1}$ transformed by a time varying matrix $G_t$ as well as system noise $w_t$. Both of the transformation elements $G_t$, $w_t$, and how those elements change over time, can be specified by us. Similarly, the observations $y_t$ (that is, something we can directly measure, like the price level) are a linear combination of current, hidden state $\theta_t$ and an additional random variation, the measurement noise $v_t$.
Intuitively one can view the state space model as an approximator of hidden state under uncertainty. For example we can imagine a faulty tracking device that provides us with noisy, error filled measurements of a rocket's position and we are interested in its velocity at any given time. As we receive more and more inputs from our error prone measuring device we can estimate the true state, the actual velocity of the rocket, with greater and greater certainty, gradually explaining away all the noise in our system.
The Kalman filter is a state-space algorithm derived from the general state-space model relationship. It is most commonly found in aeronautics, guidance and navigation of various rockets, spacecraft trajectory analysis and various other engineering tasks. However, since we have established the price of an asset as a measurable signal of its underlying trend and momentum we can use the Kalman filter to help us model this relationship.
The kalman filter is defined through a series of equations
$$X_{t+1} = \Phi X_t + c_t + w_t \tag{State equation}$$ $$Y_t = HX_t + d_t + v_t \tag{Observation equation}$$ $$X_{t+1|t} = \Phi X_{t|t} + c_t \tag{Prediction}$$ $$X_{t+1} = X_{t+1|t} + K_{t+1}(Y_{t+1} - Y_{t+1|t}) \tag{Correction}$$ $$K_{t+1} = P_{t+1|t} H^T[HP_{t+1|t}H^T + R]^{-1} \tag{Kalman gain}$$
Hidden state is once again represented as a linear combination of previous states, transformed by matrix $\Phi$, system noise $w_t$ as well as an optional correction vector $c_t$. Similarly, observations are represented as the transform of current state via transformation matrix $H$, measurement noise $v_t$ as well as the addition of a measurement correction vector $d_t$. Once the filter is specified new states are drawn via the prediction equation, once a new measurement is available subsequent predictions can be corrected via the correction and kalman gain relationship.
As before the kalman filter can be used to specify all previous models in state-space form. It should also be noted that while the kalman filter is a linear-dynamic system it is still able to model and measure non-linear behaviour. This property makes the kalman filter especially attractive, as relying on non-linear models alone (such as neural-networks and other machine learning objects) introduces many complexities into the system, while a purely linear model can be easily debugged and dealt with.
Now it is worth remembering that the kalman filter itself is just an algorithm which allows us to make better estimations of hidden states when dealing with noisy, uncertain measurements. The model specification, in other words parameters $\Phi$, $v$, $H$, $w$ and so on are all up to us to define and specify. While this does give us a great degree of freedom, nothing comes for free and estimating initial parameters can be rather difficult. For example, the kalman equation asks us to provide an initial estimate of system noise at $t_0$: $w_0$, how this estimate is derived and defined is entirely up to us.
Practically speaking, to make effective use of the filter we must have an idea of the underlying behaviour of our system. At least, we should have an approximate understanding of what it is we are trying to model. To draw upon an example, if we were modelling the beheaviour of a B52 bomber and the effects of turbulent weather on its velocity we would probability specify the whole model, noise, etc using classical mechanics.
When it comes to financial modelling things are not as clear cut.
That being said, we can make use of other algorithms especially those popular in genetic programming like "Particle Swarm Optimization" to tune our filter and derive the whole model from available data itself.
I hope to have provided sufficient theoretical background to build intuition of the modelling task at hand. Next, we will consider, develop and backtest several dyanmic linear systems under the state-space framework. The rest of this notebook will discuss the application of these models, directly to spot trend and momentum as well as indirectly as a preprocessing step for various other models.
Since we are interested in exploiting momentum, we should test our models on a somewhat volatile but not too volatile series. Ideally we are interested in various price regimes: upswings, downswings as well as flat movement. For example, there would be little point in testing against a constantly uptrending pair like BTC/USDT since even a completely terrible model would probably end up locking in some profits. Similarly, we do want momentum, so theres no point in testing against a 95% downtrending market. For this reason, I choose the familiar BCH/BTC pairing from above. Below is the 5min OHLC chart, starting early AUG and ending late NOV.
import pandas as pd
import matplotlib.pyplot as plt
from mpl_finance import candlestick2_ochl
s = pd.read_csv('/home/lisper/fintech/data/btc_bch_5m.csv')
_, ax = plt.subplots(figsize=(20,  5))
candlestick2_ochl(ax, s['open'], s['close'], s['high'], s['low'], width=2)
plt.show()
Theres some huge volatility spikes at the start as well as towards the end of the series. A prolonged period of sideways movement is also present. This should give us plenty of data-points for a fairly well rounded test against several market conditions. One final word of caution applies. We should be aware about introducing inherent bias into all of our tests: the presence of momentum is already known to us! In a live scenario for example the presence of those volatility spikes found towards the end of the series would be entirely unknown to us, the sideways movement could continue for years to come. Now relatively speaking, this is a minor point as far as tests themselves are concerned. As long as we treat tests in a completely transparent fashion, making sure the algorithm has no way to "look into the future", etc we will effectively simulate a real scenario. However, we should be aware that momentum is not a given and market conditions can change drastically. In a real setting, to guard against running a momentum model in a non-momentum market, we would supplement our model with a regime-switch model, but more on that later.
Next, lets define a model benchmark. We are not really interested in profit and loss as such, because it is a rather simplistic view of performance. As I mentioned previously, in boom market conditions even the most terrible model will be profitable. Rather, we are interested in outperforming the base market. While traditionally we could just compare our model against an index benchmark (the S&P 500 for example) no such thing is currently available for crypto markets.
We could go ahead and construct our own bucket of base crypto currencies and test against that but this is probably overkill for our simple scenario. Instead, we will test against a "buy and hold" strategy. Here we buy the asset at time $t_0$ and never sell it. Starting balance is $100. Backtest is shown below.
! backtrader ../backtests/buy_and_hold.py ../backtest_graphs/buy_and_hold.png
from IPython.display import Image, display
from IPython.core.display import HTML
display(Image("/home/lisper/fintech/backtest_graphs/buy_and_hold.png"))
A buy and hold strategy yields a total return of \$165.31, a profit of \$65.31 is achieved.
Our first state space model will be the regular identify filter. The kalman identity filter simply models the price signal as its initial value + some stochastic system noise. In other words, no transformations of the signal beyond the linear addition of noise is performed. Going forward, I will abstract away some model implementation details inside state_space.py for sake of simplicity and clarity.
# Some utility functions
import numpy as np
import pandas as pd
def load_data():
    """
    Loads our BTC/BCH price series, returns the OHLC
    as a separate vector series.
    """
    df = pd.read_csv('/home/lisper/fintech/data/btc_bch_5m.csv')
    return [x.reshape(31145, ) for x in np.split(df[['open', 'high', 'low', 'close']].values, 4, 1)]
import numpy as np
from pykalman import KalmanFilter as KF
[_, _, _, c] = load_data()
F = np.array([1]) # transition matrix
H = np.array([1]) # observation matrix (just itself)
Q = np.eye(1) * 0.001 # simple identity matrix I + random noise
R = [1.0, 5.0, 10.0, 15.5] # Measurement noise e.g. how uncertain are we about the underlying "correctness" of the price
m = np.array([c[0]]) # first closing price as initial state mean
P = np.eye(1) # simple identity matrix I as initial state variance
Means = []
for i, r in enumerate(R):
    kf = KF(
        transition_matrices=F,
        observation_matrices=H,
        transition_covariance=Q,
        observation_covariance=r,
        initial_state_mean=m,
        initial_state_covariance=P
    )
    means, _ = kf.filter(c)
    Means.append((r, means))
_, ax = plt.subplots(2, 2, figsize=(30, 25))
for i in range(2):
    for j in range(2):
        (r, mean) = Means.pop()
        ax[j][i].plot(mean[:, 0], label='System noise: {:.2f}'.format(r), color='r')
        ax[j][i].plot(c, label='Actual', alpha=0.5, linestyle=':')
        ax[j][i].legend()
plt.show()
Above are the outputs of 4 simple identity models with gradually increasing levels of systems noise. Notice how the output signal (marked in red) becomes smoother as we increase the amount of system noise. We could continue increasing system noise, but at some point we risk information loss and lag. While things are beginning to take shape already, it is quite clear that this type of model is just too simple as the noise can't be adequately explained yet. The resulting signal is
Nevertheless, lets perform a backtest of model $R = 15.50$. The strategy remains fairly simple, treat model output as the true price signal, execute a trade every hour, long if cumulative returns over the past 2 hours are positive, otherwise close the position. And once again, starting balance is set to $100. We won't be performing any risk assessment of individual trades for now and will simply risk 70% of available cash on every long. It should be noted that most momentum strategies progressively increase cash allocation as momentum is building, this is of course independent of the actual alpha model; something we should keep in mind for further profit maximization.
Finally the backtest includes a 0.1% trading fee as well as an equivalent slippage percentage, fairly standard amounts for crypto trading.
! backtrader ../backtests/identity_filter.py ../backtest_graphs/identity_filter.png
from IPython.display import Image, display
from IPython.core.display import HTML
display(Image("/home/lisper/fintech/backtest_graphs/identity_filter.png"))
The model nets a $80.03 profit and appears to outperform a simple "buy and hold" strategy. However, results are a little inconclusive...
We successfully exploit momentum whenever volatility spikes, like in the beginning as well as towards the end of the price series, however we are simply doing too many trades during those volatility spikes as well as when the market is just moving sideways. Clearly our model does not produce a signal that is smooth enough. While we successfully identify big spikes, our model is too sensitive and we end up following minor bumps and peaks.
So far we have learned that an identity model with a single hidden state (the price itself) is a fairly poor representation of the underlying systems dynamic. Lets introduce an additional unobserved state: velocity. Like in classic mechanics we need a way to represent velocity as a derivative function (here as related to price). Representing velocity as the rate of change of price between individual lags, or mathematically $V = P_t - P_{t-1}$ seems appropriate enough.
In a state space setting we can represent both the price as well as its rate of change with the following transition and observation matrices: $$ \phi = \begin{bmatrix} 1 & \delta t \\ 0 & 1 \end{bmatrix} $$ $$ H = \begin{bmatrix} 1 & 0 \end{bmatrix} $$
Interestingly, this looks similar to a local linear trend model which takes the form of: $$ \phi = \begin{bmatrix} 1 & 1 \\ 0 & 1 \end{bmatrix} $$ $$ H = \begin{bmatrix} 1 & 0 \end{bmatrix} $$ and is a more generalized version of our velocity model above.
F = np.array([[1, 1],
              [0, 1]]) # transition matrix
H = np.array([1, 0]) # observation matrix
Q = np.eye(2) * 0.001 # system noise
R = 0.7 # Measurement noise
m = np.array([c[0], 0]) # closing pice and initial velocity set to 0
P = np.eye(2) # initial state covariance
kf = KF(
    transition_matrices=F,
    observation_matrices=H,
    transition_covariance=Q,
    observation_covariance=R,
    initial_state_mean=m,
    initial_state_covariance=P
)
means, _ = kf.filter(c)
_, ax = plt.subplots(1, 2, figsize=(25, 5))
ax[0].plot(means[:, 0], color='r', label='Signal')
ax[0].legend()
ax[1].plot(means[:, 1], color='b', label='Rate Of Change')
ax[1].legend()
plt.show()
Above are the two states of our model. Note that while the signal is still very much noisy we did manage to capture 'velocity' as an additional hidden state. Compare the graph on the right to volatility spikes on the left. The apparent correctness of both states tells us that this model is a richer one, able to capture more of the underlying behaviour of the system.
The problem of noise however remains unsolved... A clear shortcoming of our model stems from our initial noise estimates: $Q = I^2 * 0.001$ and $R = 0.7$, the values are picked almost at random, I simply followed a rule of thumb when defining state space systems, that is defining Gaussian noise so that $w_t, v_t \in \mathbb{R^n} | 0 < w_t, v_t < 1$. The estimate appears to be completely off. A better noise estimate could be derived analytically. For example, David Aronson in "Evidence Based Technical Analysis" (2007) notes that the High-Low price split captures roughly 70% of information and the rest can be attributed to noise. We could follow this assumption and represent measurement noise as the difference between the high and the low at any given lag $t_n$ times the information capture percentage, or mathematically $R = 0.7 * (H_t - L_t)$. Yet, this somehow looks awfully ad-hoc and specific... We really need a more general way to tune our filter.
One apparent disadvantage of the Kalman filter is the requirement of sane starting parameters. How these are derived is left to our imagination. Again, suppose we are modelling a purely mechanical system, we could estimate noise as related to acceleration, which has known, well defined properties. In our case we are somewhat at a loss. There's no clear, analytical way to define noise when dealing with financial data.
Remember that there is no requirement for perfect initial estimates, we only need some sanity, the rest will be 'learned' by the filter as new measurements arrive. Thus, a possible solution is treating the whole issue as an optimization problem. We need an additional algorithm that will take our filter and continuously reapply it to our data while changing starting parameters until the best 'fit' (the smoothest possible signal) is found.
Simple bruteforcing, or just iterating over all possible parameters, won't work here. Although we are only estimating two parameters for now, the underlying parameter space is so large that it could take decades to arrive at a possible solution. We need a minimization algorithm that is able to adjust its search space. Metropolis–Hastings or general sampling algorithms from the Markov Chain Monte Carlo (MCMC) family are potential candidates. Yet, genetic or evolutionary programming offers one of the best solutions to our problem.
We will use Particle Swarm Optimization (PSO) to tackle our parameter search problem. PSO, like many other genetic algorithms, was inspired by behaviour found in nature. Quoting Dr. Eberhart (1995)
PSO is motivated by the behaviour of bird flocks in finding food. Suppose a flock of birds want to find food, but they do not know where the food is before they find it. PSO uses a swarm of particles to simulate these birds. Each particle is a possible solution of the optimization problem and has a random initial position $x_i$ and velocity $v_i$. The objective function targeted to be optimized $f$ is used to evaluate each particle position’s fitness.
I have defined a PSO routine inside state_space.py which we will run next to tune the $Q$, $R$ as well as a scaling parameter $\delta$. Only 10% of the data is used in the process to avoid look-ahead bias for when we run the actual backtest.
import imp
imp.reload(state_space)
bounds = np.array([[5.5, 5.5, 0.1],
                   [0.1, 0.1, 0.000001]])
Q, R = state_space.global_pso(c, dims=3, bounds=bounds)
F = np.array([[1, 1],
              [0, 1]]) # transition matrix
H = np.array([1, 0]) # observation matrix
m = np.array([c[0], 0]) # closing pice and initial velocity set to 0
P = np.eye(2) # initial state covariance
kf = KF(
    transition_matrices=F,
    observation_matrices=H,
    transition_covariance=Q,
    observation_covariance=R,
    initial_state_mean=m,
    initial_state_covariance=P
)
means, _ = kf.filter(c)
_, ax = plt.subplots(1, 2, figsize=(25, 5))
ax[0].plot(means[:, 0], color='r', label='Signal')
ax[0].legend()
ax[1].plot(means[:, 1], color='b', label='Rate Of Change')
ax[1].legend()
plt.show()
Although still bumpy, we retrieve a much smoother signal (especially comparing "the rate of change" states) through the combination of a Kalman filter and PSO tuning. The backtest of this new model is attached below.
! backtrader ../backtests/llt.py ../backtest_graphs/llt.png
from IPython.display import Image, display
from IPython.core.display import HTML
display(Image("/home/lisper/fintech/backtest_graphs/llt.png"))
The model achieves a \$233.06 return for a total profit of \$133.06. This model clearly outperforms our previous implementations. Lets go over this benchmark in more detail. The clearest change is probably the far fewer number of trades executed. This should at least give some more credence to our belief that a smoother signal will allow us to identify an underlying trend more clearly. Indeed, this model appears to 'learn' the signal at least to some degree, fewer trades are executed in sideways regimes. Both volatility spikes, in the beginning as well as towards the end of the series, are still our most profitable spots, which again coincides with our ideas about momentum. Note however how we suffer some losses towards the end of the series. We appear to have some issues with correctly following the trend in time. The problem could also be related to how exactly we calculate the presence of a trend vs. the absence of a trend. We should focus on exploring both next.
The common stochastic model in state space is defined by a set of relationships between a polynomial of order $d$ and a seasonal model of order $n$ so that $$ \phi = \begin{bmatrix} 1 & 1 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & -1 & -1 \\ 0 & 0 & 1 & 0 \end{bmatrix} $$ $$ H = \begin{bmatrix} 1 & 0 & 1 & 0 \end{bmatrix} $$ $$ R, Q = I^4 * v_t \sim \mathbb{N}(0, V) $$
This model is able to capture 'physical' components (like our previous model) of the underlying system, 'seasonal' components - in our case specifically autocorrelation between returns - as well as identify a common stochastic process. The significance of the latter becomes clear once we realize that modelling with constant components, as we did before, is inappropriate since price does not follow a clear, normal distribution. We can once again draw a parallel to modelling the motion of a physical object. Without a major external force acting on the object, we can be fairly sure that its trajectory will remain on track so to say. A missile in motion is not gonna suddenly jerk up and fly off into the stratosphere. At least this is a sane assumption to make, if this does indeed occur we can be fairly certain that something quite extraordinary took place and well, no model could help us there.
When it comes to modelling price however we can never be sure about trajectory change and especially the magnitude of this change. We only need to take one look at those volatility spikes found in our series to validate this theory. Sudden upswings and downswings, especially in crypto, are arguably the norm rather than the exception. Extraction of a common stochastic component can help us model and predict sudden changes in trajectory which should ultimately result in a more robust model.
To avoid overall verbosity I have defined a common stochastic model as well as its tuning procedure via PSO inside state_space.py
import state_space
kf = state_space.common_stochastic(c)
kf = state_space.tune_stochastic(kf)
means, _ = kf.filter(c)
_, ax = plt.subplots(1, 2, figsize=(25, 10))
ax[0].plot(c, alpha=0.5, label='Actual', linestyle=':')
ax[0].plot(means[:, 0], color='red', label='Smooth')
ax[0].legend()
ax[1].plot(means[:, 1], label='Hidden Levels')
ax[1].plot(means[:, 2], label='Hidden Seasonality')
ax[1].plot(means[:, 3], label='Hidden Stochastic Noise')
plt.ylim(-0.0001, 0.0001)
ax[1].legend()
plt.show()
! backtrader ../backtests/stochastic.py ../backtest_graphs/stochastic.png
from IPython.display import Image, display
from IPython.core.display import HTML
display(Image("/home/lisper/fintech/backtest_graphs/stochastic.png"))
This model more than doubles our previous winnings, a total return of \$462.55 and a profit of \$362.55 is achieved. As can be seen from the backtest graph, this is largely due to us avoiding big losses during volatility spikes and being able to take larger profits at the same time. This model is able to capture some stochastic factors previously unaccounted for by previous specifications. Next, we will atempt final de-noising through live error correction as well as deal with our trend spotting algorithm.
Recall that we try to identify the direction and presence of a trend/momentum by looking at past cumulative returns over the period of 2 hours. If past returns are positive we long since this is indicative of momentum buildup, we close open positions (sell) otherwise. Now, this is a rather uninformed and at the same time awfully specific way to look at it. For one, we have no idea whether that period of 2 hours is the right autocorrelative lag order. Previously, at the beginning of this notebook, we performed correlation tests over various time windows to identify the right lookback period. We could do the same here and we could even perform these tests while the model is trading and adjust our momentum lookback window accordingly.
A far larger problem is the specificity of this algorithm. As I've stated previously, ad-hoc implementations should be avoided. With our current algorithm, we have for example no way to specify sensitivity of the lookback window (how sensitive are we to the rate of change?), we also have no way to assess certainty of momentum (recall we used p-values, the smaller the better, previously in our correlation tests).
Thus, going forward we will use the Mann-Kendall test for trend defined as:
$$ \mathbb{S}_t^n = \sum_{i=0}^{n-2}\sum_{j=i+1}^{n-1}{sgn(y_{t-i} - y_{t-j})} \text{ where } \begin{cases} \mathbb{S} = 1, \text{perfectly positive trend} \\ \mathbb{S} = -1, \text{perfectly negative trend} \\ \mathbb{S} = 0, \text{no trend} \\ \end{cases} $$
The Mann-Kendall test is defined in state_space.py. Below is a quick demo for illustrative purposes.
import state_space
uptrend = np.arange(0, 500)
downtrend = np.array(list(reversed(uptrend)))
no_trend = np.array([250] * 500)
plt.plot(uptrend, label='Uptrend')
plt.plot(downtrend, label='Downtrend')
plt.plot(no_trend, label='No Trend')
plt.legend()
plt.show()
d, _, p_value, _ = state_space.mk_test(uptrend, alpha = 0.05) # alpha is the sensitivity paramter
print('First series is {:s}, p-value certainty: {:.2f}'.format(d, p_value))
d, _, p_value, _ = state_space.mk_test(downtrend, alpha = 0.05)
print('Second series is {:s}, p-value certainty: {:.2f}'.format(d, p_value))
d, _, p_value, _ = state_space.mk_test(no_trend, alpha = 0.05)
print('{:s} found in last series, p-value certainty: {:.2f}'.format(d, p_value))
The Kalman filter is an iterative approximator, meaning hidden states, noise representations and everything else inside the filter can be adjusted as new measurements are made available. So far we have performed error correction inside the filter via the regular predict/update loop as described by the Kalman filter equation. However, because of the simplicity and extensibility of the Kalman filter it is very easy to implement additional corrective steps.
Since we are only interested in past states of the filter (we apply the Mann-Kendall form on past returns to predict momentum) we can extend our previous model by supplying a correction matrix $c_t$ to the filter which is treated as an additional post-processing step of sorts. Here we should remember that, at the ground level, we are doing signals processing and that many of the principles commonly found in estimating information ratios, smoothness, distortion, etc of real physical signals can also be applied here. Below, I define our previous model with the addition of a corrective step which will be performed periodically. The corrective procedure involves estimation of power spectral density of the signal using Welch's formulation, a smoothing factor $\alpha$ is then applied based on estimations from previous steps.
The model is once again tuned via PSO. The model, corrective procedures as well the tuning are defined in state_space.py. Below is a test with various levels of corrective smoothing.
import state_space
Alpha = [100, 200, 300, 400]
_, ax = plt.subplots(2, 2, figsize=(25, 15))
for i in range(2):
    for j in range(2):
        alpha = Alpha.pop()
=        means = state_space.corrective_stochastic(c, alpha)
        ax[i][j].plot(means, color='r', label='Alpha factor {:d}'.format(alpha))
        ax[i][j].plot(c, alpha=0.5, linestyle=':', label='Actual')
        ax[i][j].legend()
plt.show()
We finally achieve a fairly decent level of signal clarity without any major information loss - the model still identifies significant peaks and dips correctly - the smoothing is also completely lagless, we correctly estimate the overall trend without lagging behind the actual price. More importantly, we can now control the level of smoothness at will, if we want a more aggressive model we can reduce alpha and the model will try to trade insignificant peaks, while increasing alpha will lead to a more conservative model. All of this could be controlled automatically and adjusted based on market conditions or another metric. Finally, we run two baktests with the Mann-Kendall window set to 24 hours, Mann-Kendall alpha at 0.05 and smoothing alpha of 200 and slightly reduce at 185.
! backtrader ../backtests/corrective_stochastic.py ../backtest_graphs/corrective_stochastic.png ../backtest_graphs/corrective_stochastic_2.png 
from IPython.display import Image, display
from IPython.core.display import HTML
display(Image("/home/lisper/fintech/backtest_graphs/corrective_stochastic.png"))
from IPython.display import Image, display
from IPython.core.display import HTML
display(Image("/home/lisper/fintech/backtest_graphs/corrective_stochastic_2.png"))
The backtests show that the corrective stochastic model as well as the addition of the Mann-Kendall test is the best setup so far. This model outperforms the base 'buy-and-hold' market almost by a factor of 6, netting a profit of \$520.19 (alpha 200) \$542.43 (alpha 185) with total returns exceeding \$600.00 for both alpha paramters and a capital growth rate in excess of 520%.